fix: prevent duplicate daily recap notifications via atomic SETNX lock (#4594)#4646
fix: prevent duplicate daily recap notifications via atomic SETNX lock (#4594)#4646
Conversation
Replace has_daily_summary_been_sent check with try_acquire_daily_summary_lock, move in-function imports to module top-level, and remove post-LLM set_daily_summary_sent call (lock now handles dedup). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9 tests covering SETNX lock behavior, concurrent access simulation, and _send_summary_notification lock integration (skip/proceed/no-convos). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request effectively addresses a critical race condition that caused duplicate daily summary notifications. The fix correctly replaces a non-atomic check-then-set logic with an atomic SETNX operation in Redis, ensuring that only one job instance can proceed with the expensive LLM call. The changes are clean, and the removal of unused functions and moving of imports improve code quality. The addition of a new test suite for this change is commendable and covers various scenarios, including concurrency. However, there is a critical flaw in the new test file backend/tests/unit/test_daily_summary_race_condition.py: it tests a copy of the try_acquire_daily_summary_lock function instead of the actual production code. This undermines the validity of the unit tests for the lock function itself. I've added a critical review comment with a suggestion on how to fix this to ensure the tests are robust and correctly validate the implementation.
| def try_acquire_daily_summary_lock(uid: str, date: str, ttl: int = 60 * 60 * 2) -> bool: | ||
| result = mock_r.set(f'users:{uid}:daily_summary_lock:{date}', '1', ex=ttl, nx=True) | ||
| return result is not None | ||
|
|
||
|
|
||
| redis_db_mod.try_acquire_daily_summary_lock = try_acquire_daily_summary_lock |
There was a problem hiding this comment.
This test is testing a local, copy-pasted version of try_acquire_daily_summary_lock, not the actual implementation from database/redis_db.py. This means that if the logic in the original file changes, this test might still pass, giving a false sense of security. The test should import and test the production code directly.
To fix this, you should import the actual function and mock its dependencies. Given the extensive module stubbing in this file, a robust way to achieve this is to patch the redis.Redis client at the top of the file before any application modules are imported. This will prevent the real Redis client from being instantiated.
Example of how you could refactor the test setup:
# At the top of your test file
from unittest.mock import MagicMock, patch
mock_r = MagicMock()
patch('redis.Redis', return_value=mock_r).start()
# Now you can safely import from your application modules
# You might need to adjust the existing module stubbing
from database.redis_db import try_acquire_daily_summary_lock
# ...
class TestTryAcquireDailySummaryLock:
"""Tests for the atomic SETNX lock function."""
def test_lock_acquired_returns_true(self):
mock_r.reset_mock()
mock_r.set.return_value = True
assert try_acquire_daily_summary_lock('uid1', '2026-02-07') is True
mock_r.set.assert_called_with('users:uid1:daily_summary_lock:2026-02-07', '1', ex=7200, nx=True)
# ... other tests for try_acquire_daily_summary_lockThis change is critical to ensure the tests are correctly validating the production code.
- test_redis_error_propagates_no_silent_swallow: ConnectionError not silently swallowed - test_utc_fallback_still_acquires_lock: user_data without timezone still hits lock Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Test results:
11 new tests added (
Please review when ready. by AI for @beastoin |
Integration Test ConfirmationRan the notification job against dev environment to validate the SETNX race condition fix end-to-end. Setup
Run 1 — Lock Acquisition
Run 2 — Duplicate Prevention (simulating overlapping cron)
Redis VerificationConclusionThe atomic SETNX lock prevents duplicate daily summaries when overlapping cron instances hit the same user. Before this fix, ~51% of users received duplicates (per @mon's BQ analysis in #4594). The race window between check-and-set (~3 min for LLM work) is now eliminated — the lock is acquired before any expensive work begins. Integration test by AI for @beastoin |
|
Deployment status: Live in prod. Deployed via @mon can you check BQ ~24h after deployment to verify duplicate daily summaries are no longer occurring? The previous analysis showed ~51% duplication rate, so a clean day would confirm the fix. by AI for @beastoin |
Summary
has_daily_summary_been_sent/set_daily_summary_sent) with atomictry_acquire_daily_summary_lockusing RedisSETNX— lock is acquired BEFORE the expensive LLM call, not aftergenerate_comprehensive_daily_summary,daily_summaries_db) to module top-level per codebase rulesset_daily_summary_sent/has_daily_summary_been_sentfunctions fromredis_db.pyFixes #4594
Root cause
Cloud Scheduler triggers the notification cron every 1 minute. The ~3 minute LLM generation gap between
has_daily_summary_been_sentcheck (line 119) andset_daily_summary_sentset (line 162) allowed multiple job instances to pass the check before any completed. Per @mon's BQ analysis, ~51% of users received duplicate daily summaries, roughly doubling the daily_summary LLM cost.Fix
Same pattern as
try_acquire_listen_lock()(redis_db.py:744) —r.set(key, '1', ex=ttl, nx=True)returnsTrueonly for the first caller. All subsequent callers getNoneand return early.Test plan
backend/test.shpasses (all existing tests unaffected)daily_summarycall distribution after deploy — expect ~100% single-call per user🤖 Generated with Claude Code